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ABSTRACT 


XML (Extensible Markup Language) processing can incur 
significant runtime overhead in XML-based infrastructural 
middleware such as Web service application servers. This 
paper proposes a novel mechanism for efficiently processing 
similar XML documents. Given a new XML document as a 
byte sequence, the XML parser proposed in this paper nor- 
mally avoids syntactic analysis but simply matches the doc- 
ument with previously processed ones, reusing those results. 
Our parser is adaptive since it partially parses and then re- 
members XML document fragments that it has not met be- 
fore. Moreover, it processes safely since its partial parsing 
correctly checks the well-formedness of documents. Our im- 
plementation of the proposed parser complies with the JSR 
63 standard of the Java API for XML Processing (JAXP) 
1.1 specification. We evaluated Deltarser performance with 
messages using Google Web services. Comparing to Piccolo 
(and Apache Xerces), it effectively parses 35% (106%) faster 
in a server-side use-case scenario, and 73% (126%) faster in 
a client-side use-case scenario. 


Categories and Subject Descriptors 
1.7.2 [Computing Methodologies]: Document and Text 


Processing—markup languages, standards; D.3.4 [Software]: 


Processors—run-time environments 


General Terms 


Performance, Design, Experimentation 


Keywords 
XML parsers, SAX, automata 


1. INTRODUCTION 


XML (Extensible Markup Language) [5] processing can be 
a significant runtime overhead in XML-based infrastructural 
middleware such as Web Services application servers. [28, 
11, 15, 25, 8, 18] The good and bad of XML both lie in 
its verboseness. For example, well-formed pairs of named 
marking-up tags for XML elements contribute to its human- 
friendliness and vender-neutrality, but require extra com- 
putation in processing. The computation specific to XML 
includes variable representations for the same tag, the han- 
dling of namespaces, tolerance for multi-character encod- 
ings, etc. 
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This paper proposes a novel mechanism for efficiently pro- 
cessing XML documents in most cases. Its notable feature 
is to remember the documents processed previously and to 
use those processed results to deal with a new XML doc- 
ument. Incoming XML documents that share very similar 
parts with previously-processed documents can be processed 
very quickly with this feature. Even though this feature 
adds some overhead in processing strange XML documents 
(those which are not similar to the documents processed be- 
fore) compared to normal XML processing, the gain from 
the speed-up for similar documents can easily compensate 
for the overhead where there is repetitive processing as in 
situations like Web Services middleware. 

Given a new XML document as a byte sequence, the XML 
parser proposed in this paper, in most cases, does not an- 
alyze most of the XML syntax in the document but just 
compares the byte sequence with the ones that were previ- 
ously processed. The parser then reuses the resultant parse 
events stored during the previous processing. Only the parts 
differing from the previously-processed documents are pro- 
cessed in the normal way for XML parsing. It remembers 
the byte sequences in a DFA (Deterministic Finite Automa- 
ton), where each state transition has a byte sequence and 
its resultant parse event. In addition, the parser remembers 
processing contexts in DFA states so that it can partially 
parse unmatched byte sequences. The parsing process nor- 
mally follows state transitions in the DFA by matching byte 
sequences. If no states to transit to are found, it partially 
parses the unmatched byte sequence until it finds a resultant 
state from which it can transit to existing states. Then it 
continues to make transitions in the DFA. 

The notable feature of our DFA structure is its “safety” 
and “efficiency” in reusing parse events from previously- 
processed XML documents. The safe reuse means that the 
resultant parse events conform to the XML specification [5, 
4]. Note that it must accept well-formed XML documents 
and reject ill-formed ones. Efficient reuse means that it 
should reuse as many parse events as possible from previous 
parsing while minimizing computation costs for the reuse. 
Less reuse can cause extra partial parsing, which is costly, 
and higher reuse computation cost could cancel the benefits 
of reuse. We need to find good trade-offs for efficient reuse, 
since aggressive reuse sometimes increases the computation 
cost for reuse, and vice versa. We carefully designed the 
strategy for constructing DFAs so that they can find a good 
trade-off point for efficiency. 

Several improvements have been created for efficiently 
processing XML documents using application-specific infor- 


mation available from their schemas, such as DTDs. A few 
investigated optimization of lexical analysis [8, 9]. However, 
schema-based techniques cannot capture the optimizations 
from byte-level resemblances among documents, especially 
in prefix-namespace binding and in the representations of 
tags, since they are not specified in schemas. Most of the 
schema-based optimization efforts are focused on higher- 
level XML processing rather than on lexical analysis of XML 
documents serialized as byte sequences. For an example of 
XML document validation, XSM [20] uses cardinality con- 
straint automata specialized by accepting schemas, but it 
assumes SAX events are the input of the automata, not 
low-level byte sequences. 

We implemented in Java an XML parser named Deltarser 
based on the described mechanism. Our SAX implemen- 
tation complies with the JSR 63 [24] (Java API for XML 
Processing 1.1) standard specification. We conducted some 
experiments and observed its advantages over ordinary XML 
parsers in certain representative scenarios which we believe 
cover the majority of practical Web Service usages. Though 
this paper focuses on the SAX implementation, the same 
technology is easily applied to other streaming parsers. In 
fact, our pull-parser implementation complies with the JSR 
173 [13] (Streaming API for XML) standard specification 
and we observed efficiency similar to the SAX implementa- 
tion. 

The rest of the paper is organized as follows: In Section 2, 
we specify the overheads in XML parsing and explain why 
there should be opportunities for reducing these overheads 
in many practical cases. We describe the design and imple- 
mentation of our XML parser, Deltarser, in Section 3 and 
show its effectiveness through the experimental results in 
Section 4. Finally, we conclude this paper with Section 5. 


2. XML PROCESSING OVERHEAD 


XML (Extensible Markup Language) [5] is heavily used 
in XML-based infrastructural middleware such as Web Ser- 
vices [22, 10] application servers, and its processing can be a 
significant runtime overhead. The reason why the processing 
overhead of XML is high lies in its verboseness. 


2.1 XML Processing in Web Services 
Middleware 


The cost of XML parsing has a large impact on the ex- 
ecution performance of recent XML-based middleware [28, 
11, 15, 25, 8, 18]. XML-based middleware analyzes XML 
documents to construct data objects for its application pro- 
grams or for its own use. For example, a Web Service appli- 
cation server for service applications in Java receives SOAP 
messages encoded in XML and deserializes them into Java 
objects while it reads its settings from a file, whose format 
is also defined in XML by JSRs [16, 17]. 

Web Service middleware uses XML heavily since XML 
is a key construct of Web Services. Web Services are en- 
abling technologies for vender-independent distributed en- 
vironments where loosely coupled servers can interconnect 
to each other. In such environments, standard specifica- 
tions define most of things to be written in the XML-based 
languages recommended in the specifications. For example, 
messages passed in Web Services are enveloped according to 
SOAP [22], which is a protocol using XML for messaging. 

In addition to messaging, there are a number of other 
places where XML is used in Web Services middleware. For 
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publishing available Web Services, the services are described 
in WSDL [10] (Web Services Description Language), which 
specifies the format for service descriptions. Configuration 
files like webservice.xml for application servers are often 
described in XML, as defined in JSR 109 [16] and JSR 
921 [17]. 


2.2 Overhead Constructs in SAX Processing 


The computations specifically required to process XML 
documents include tolerance for various character encodings, 
variable length data, ignorable white spaces, line break nor- 
malization, the handling of namespaces, and the creation of 
parsed result objects. Figure 1 shows the normal processing 
model in conventional XML parsers for SAX [24]. 


<n:doc xmins:n=“uri:n1”>\r\n ... \r\n</n:doc> 


bytes 


characters 


QNAME QNAME 
“n:doc” “xmins:n” 


=" VALUE “> 
“uriini” 


ge 


lexical TEXT 
“Arin 
tokens 


StartTag Text 
name="n:doc”, attributes={ ..} text="rin ” 


EndTag 
parser name=“n:doc” 


text="\rin 
events 


StartDocument StartPrefixMapping —_StartElement 
="n", uri=“uri:n1” name=“n:doc”’.. 


SAX 
events 


EndElement EndDocument 
name=“n:doc” 


Figure 1: Processing stages in normal XML parsers 
for SAX. 


First, an XML parser needs to convert the character en- 
codings, since the external encoding, the encoding format in 
which an XML document is encoded, may be different from 
the internal encoding, the encoding format used in a pro- 
gram. For example, an XML document is often encoded in 
UTF-8, while a Java program handles characters in UTF-16. 
An XML parser must convert the original UTF-8 encoded 
characters into UTF-16 encoded characters in order to use 
the characters in the middleware or to pass them to appli- 
cation programs. 

Processing XML documents requires the recognition of 
token delimiters like “<” and “>” [5] since XML data ob- 
jects normally have varying lengths. Like most program- 
ming languages, an XML document must be “parsed” to 
recognize lexical tokens, even though the syntax is relatively 
simple and easy to analyze. This is extra runtime overhead 
compared to efficient wire formats with fixed-length struc- 
tures [6]. 

XML allows ignorable spaces in its tags. For example, a 
start tag: 


<doc language="Japanese"> 
can also be written as: 


< doc language = ’Japanese’ > 


White spaces, tabs, new lines, and carriage returns are al- 
lowed in many places. It means that there are a variety of 
ways to express a tag with the same meaning. An XML 
parser must skip such ignorable spaces. This adds an over- 
head cost since it requires testing whether or not each char- 
acter data is one of these ignorable characters. 

The normalization of line separators is also a task handled 
in an XML parser. The XML specification allows lines to 


be separated by single LF characters, by CR LF character 
sequences, or by single CR characters. However, internally 
these separators are always converted to single LF charac- 
ters. An XML parser must detect LFs, CR LF pairs, and 
CRs to convert all of them to LFs. 

An XML document may use namespaces [4] to avoid col- 
lisions among tag names or attribute names, so an XML 
parser must handle namespaces for documents using them. 
Since XML namespaces are scoped within elements, name- 
space handling involves stacked map management for match- 
ing each namespace prefix to its namespace URI. For exam- 
ple, the nested elements: 


<n:doc xmlns:n="uri:ni"> 
<n:title xmlns:n="uri:n2"/> 
<n:item/> 

</n:doc> 


have the same namespace prefix “n” but they must be re- 
solved to different namespace URIs, uri:n1 for doc and 
item, but to “uri:n2” for title. 

Finally, an XML parser needs to construct parsed result 
objects so that it can pass them to its users (middleware 
or application programs) through an API like SAX. For ex- 
ample, with SAX 2.0 [24], an application subclass overrides 
a method startElement () in the class ContentHandler. A 
SAX parser is responsible for creating String objects for 
its namespace, local name and qualified name parameters, 
and Attributes objects for its attributes parameter to pass 
them through this interface. 


2.3 Opportunities and Challenges in XML 
Processing 


On balance, we believe there should be a lot of opportu- 
nities to optimize the processing of XML documents. Even 
though optimizing for general cases is hard to do, optimizing 
for special cases is possible. For example, if we knew all the 
messages were formed in a certain way, we could optimize 
the parsing process for such XML documents. 

In an environment of Web Service application servers, we 
can expect that most of the messages will be generated by 
machines. In particular, RPC-style request-response mes- 
sages are often generated by middleware with XML serial- 
izers. When accessing Web services in client code, proxy 
classes and frameworks provided by middleware handle all 
of the infrastructure coding. For example, a C# client uses 
the .NET framework to access Web Services, and a JAX- 
RPC client may use the javax.xml.rpc.Call instances im- 
plemented by some particular vendor. 

Though formatting styles are different for various pro- 
gramming languages, implementation vendors, or versions, 
the same XML serializer implementation generates the same 
kind of service requests and responses with different param- 
eters and return-values in very similar byte sequences. This 
is because such XML serialization is performed by a certain 
runtime library or by proxy code generated by a certain tool 
provided by middleware or a development environment. As 
a result, recurring SOAP messages often form very similar 
byte sequences except for the contents of elements repre- 
senting different parameter values, as long as they are sent 
by application programs with the same infrastructural mid- 
dleware. 

A key challenge in optimizing XML processing is to pre- 
serve the interoperability of the XML documents. We can 
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do anything in closed environments using XML as the wire 
format, but it is expected that XML is mainly used in open 
environments, where there may be many participants con- 
forming to the XML specifications [5]. An optimized XML 
parser must not reject or be broken by XML messages in 
forms unexpected by the optimization. 

In addition to the compatibility, another key challenge is 
to make optimizations tolerant of implementation changes 
in infrastructural middleware. An optimization for a limited 
set of XML serializer implementations can easily be obsolete 
and ineffective, since they are easily changed by new imple- 
mentation versions or by new implementation vendors. 


3. DELTARSER 


We designed an XML parser named Deltarser, which can 
efficiently and safely process XML documents similar to pre- 
viously processed documents. The efficiency for similar doc- 
uments applies in situations like Web Service application 
servers, where middleware needs to process a lot of similar 
documents generated by other middleware. In addition, the 
parsing of Deltarser is “safe” in the sense that it checks the 
well-formedness of the processed documents. 

From the viewpoint of users, Deltarser looks just like an 
XML parser implementation and has the same functionality 
as normal XML parsers such as the Apache Xerces [27] im- 
plementation. In fact, the SAX implementation of Deltarser 
complies with JSR 63 [24], which is a standard API for XML 
processing in Java. Applications or middleware using XML 
parsers through the standard API can easily be changed to 
use Deltarser without modifying the code. 

In the best cases, given a new XML document as a byte 
sequence, Deltarser does not analyze most of XML syntax 
in the document, but just compares the byte sequence with 
those that were already processed. The parser reuses the 
processed results stored in memory for the matching parts. 
Figure 2 depicts the abstract view of SAX processing with 
Deltarser. 


<n:doc xmins:n=“uri:n1”>\r\n ... \r\n</n:doc> 

bytes 

differential Matched Different Matched 

information parts parts parts 
Recorded Generated Recorded 
events events events 

SAX StartDocument _ StartPrefixMapping StartElement EndElement EndDocument 


prefix="n”, uri=“uri:n1” name=“n:doc” .. name=“n:doc’ 


events 
Figure 2: Processing stages in Deltarser for SAX. 


The key ideas and technologies we developed for Deltarser 
are as follows: 


Byte-level matching for XML processing It makes use 
of byte-level matching for the most parts of the doc- 
ument to be processed. Since we take the context of 
the XML document into consideration, we can reli- 
ably compare XML documents by only doing byte- 
level matching. 


Remembering processed documents in a DFA For ef- 
ficiently remembering and comparing previously pro- 
cessed documents, it remembers the byte sequences of 
the processed documents in a DFA (Deterministic Fi- 
nite Automaton) structure. Each state transition in 
the DFA has a part of a byte sequence and its resul- 
tant parse event. 


Partial parsing It partially processes XML parsing only 
the parts that differ from the previously-processed doc- 
uments. Each state of the DFA preserves a processing 
context required to parse the following byte sequences. 


Incremental well-formedness checking It reliably che- 
cks the well-formedness of the incoming XML docu- 
ments even though it does not analyze the full XML 
syntax of those documents. Deltarser’s partial XML 
processing for differences checks whether or not the 
entire XML document is well-formed. It retains some 
contextual information needed for that processing. 


In the rest of this section, first, we introduce the funda- 
mental framework of the automaton used to extract differ- 
ences from incoming documents and how to incrementally 
construct the automaton. We go on to explain how to pro- 
cess the partial results. Finally, we describe how to effi- 
ciently notify the applications of the SAX events. 


3.1 Byte-level Matching for XML Parsing 


The main action of Deltarser is byte-level comparison, 
which is much faster than actual parsing. When feeding 
the actual XML document to this state machine, we have 
only to compare the byte sequence of each state and the 
incoming document. 

Applying naive byte-level comparison to XML documents 
gives rise to a serious problem. We might misinterpret the 
document because the byte-level representation of some XML 
fragment is not uniquely interpreted based on the informa- 
tion currently in view of the XML infoset, i.e., the same byte 
sequence can be interpreted with different semantics, which 
depend on the current context. Let us look at the following 
example. 


<x:a xmlns:x="ns1i"> </x:a> 
<x:a xmlns:x="ns2"> </x:a> 


Let us focus on both of the end tags. Although they match 
exactly, they have different semantics in the XML view. This 
is because the namespace declarations are different. 

In order to address this problem, we must take context 
into consideration when matching XML documents at the 
byte level. We rely on Proposition 1 below to process XML 
documents. 

In preparation for presenting the proposition, we define a 
context, C = (E, N, D,1), which consists of E, a sequence of 
elements to which it currently belongs; N, all of the name- 
space definitions that are currently visible; D, all declared 
entities of the document; and l € {0,1,2,3,4,5}, the loca- 
tion in the document, where l = 0 means the beginning of 
the document, l = 1 means either a document declaration 
or a document element has not occurred, l = 2 means a doc- 
ument declaration has occurred but the document element 
has not, l = 3 means it is now within the document element, 
l = 4 means it is now after the document element, and l = 5 
means the end of the document. As a special treatment for 
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the uniqueness of initial and final states, when l = 0 orl = 5, 
we define all the other components in the context as empty. 
This treatment does not cause any problem because at the 
beginning and the end of the document we do not need any 
other information than l for that context. 


DEFINITION 1. Let d be an XML document, and p € Integer 
be the offset in bytes that points to d. By definition, C(p) is 
the context of the point p. 


Let us consider the following XML document. 
<a xmlns:p="xxx"> <b> A </b> </a> 


In this example, the context at the point of A is 
E = {a,b}, N = {(prefix =’ p’, uri =’ xxx’)}, D = {3,1 = 3. 


PROPOSITION 1. Let ev; be an event and C be a context 
that evı belongs to. Provided that the document from a cer- 
tain point, po, matches with the byte sequence of evı for its 
length and C(po) is equal to C1, then the parsed event from 
po must be equal to evı. 


Proposition 1 is naturally inferred from the specification of 
XML [5, 4]. 

Strictly speaking, an external entity reference is resolved 
every time it is about to be parsed, which allows for the 
possibility that the referred entity might have been changed. 
In this paper, however, we do not take this into consideration 
since it is a very unusual case. In addition, SOAP does not 
allow document type declaration. 

In order to simplify the character-level interpretation, we 
assume the character encoding of the document is “state- 
less”. Both UTF-8 (1 byte is 1 octet) and UTF-16 (1 byte 
is 2 octets according to ISO/IEC 10646), which are the only 
encodings XML parsers must support, satisfy this require- 
ment. Stateless here means that 1) one character directly 
corresponds to the certain number of bytes; and 2) an XML 
special character forms a distinct one byte in the encoding. 
For other encodings such as ISO/IEC 2022 variants, we can 
convert them before processing a document (as the normal 
parsers do). In addition, such complex encodings are rarely 
used in XML, and especially for SOAP messages, because 
such encodings cause many problems for interoperability[3]. 
Thus, this approach is not a drawback compared to other 
XML parser implementations. 


3.2 Representing Parsed Documents as an 
Automaton 


In Deltarser, each parse event and its corresponding byte- 
sequence are stored in one edge of an automaton. This au- 
tomaton has two major characteristics: 1) it can directly 
process byte-level events; and 2) it accepts only well-formed 
XML documents. Let us look into how this automaton is 
constructed. 

First, we represent a parsed document as a sequence of 
events. Figure 3 shows the static structure of the event 
classes. 

Each event has a one-to-one correspondence to a fragment 
of the XML document so that it has a byte representation in 
the actual document. Note that a context is updated after 
each event is processed, which consists of a sequence of Start- 
TagEvent and all of the declared entities. From the context, 
we can identify 1) what namespace declarations have been 


1 Context Event 


[-defined-entities byte-sequence 


TagEvent | | TextEvent | [DocTypeEvent] | PIEvent | [CommentEvent] [EntityRefEvent] | 1 


-name Hext 


ji i 
EndTagEvent TagWithAttributesEvent 
attributes 


Attribute 


[name 
[value 


— 


1 * 


EmptyTagEvent StartTagEvent NameSpaceDecl 


Figure 3: Static structure of an event in a UML 
diagram. 


declared so far; 2) the hierarchy of elements where the event 

is located; 3) how to resolve entity references; and 4) which 

part of the document is currently processed. This informa- 

tion is an integral part of how to construct an automata. 
Let us look at the following sample document. 


<p:e xmlns:p="urni">text<x a="ccc" p:b="ddd"/></p:e> 


This is converted to a sequence of events as follows. 


[StartTag: name="e" uri="urn1" 
{Attributes: } 
{NSDecls: (prefix="p", 

(Text: value="text"] 

[EmptyElementTag: name="x" uri="" 

{Attributes: (name="a", uri="", value="ccc") 
(name="b", uri="urni", value="ddd"}] 
[EndTag: name="e" uri="urni"] 


uri="urni") }] 


After converting the document into events, we can con- 
struct an automaton by regarding the context itself as its 
state. In this step, the same contexts are unified into only 
one state. The equivalence of contexts is naturally defined 
as being all of the items in these context are equal, i.e., all 
the element in EF, all of the declared namespaces in N, all 
of the entity definitions in D, and the integer value of l are 
equal. We treat the initial and final states as special. The 
initial state is regarded as the start of the document, i.e., l 
in its context is must be 0; and the final state is regarded as 
the end of the document, i.e., l in that context must be 5. 

Let us look at the process step by step. First, we define 
all the contexts in the document as states in the automaton. 
Next, by tracking the sequence of events that come from the 
actual document, we can connect these states with the event 
as edges of the automaton. In this phase, events that have 
the same byte representations are regarded as equal, and 
we do not add duplicated connections with states, because 
if byte representations and contexts are equal, then the in- 
terpreted events are also equal. Finally, we mark the states 
where l in the state’s context is 0 and 5 as initial and final 
states, respectively. The above example can be represented 
as shown in Figure 4. 


3.3 Matching a Document with an Automaton 


Although Deltarser compares an incoming document at 
byte level, it identifies the differences at the event-level, i.e., 
whenever it finds any discrepancies at the byte level, it in- 
terprets them as event-level differences. 
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>| E={urnl#e}, N={ (prefix="p", uri="urn1")}, D={}, 1=3 } 


/ 


<x a="ccc" p:b="ddd"/>' 


</p:e> 
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Figure 4: Example directed graph that corresponds 
to a parsed document. 


<p:e xmlns:p="urn1">text<y/></p:e> 
Figure 5: A sample XML document. 


Now let us see how a similar document is actually pro- 
cessed. We suppose that we have the state machine shown 
in Figure 4 beforehand, and process the sample document 
shown in Figure 5. 

First, we set the current state to the initial state and 
the current position to the head of the incoming document. 
Since the only possible next state is <p:e xmlns:p="urn1">, 
we compare its byte sequence with the sample document 
from the current position. These are exactly the same until 
the end of the state’s byte sequence, and therefore, we move 
the current state to the next state, 

E = {urnl#e}, N = { (prefix =" p”, uri =” urn1”)}, 
D={},l=3, 

and the current position to the corresponding position. At 
this point, the possible next events are text, 

<x a="ccc" p:b="ddd"/>, or </p:e>. In order to efficiently 
compare these with the incoming document, we use a binary 
search technique. Provided that the byte sequence of these 
events was sorted beforehand, we can quickly find the ap- 
propriate state. In this case, we select text as the accepted 
event, and stay in the same state. 


3.4 Updating an Automaton 


Our parser incrementally modifies the original DFA by 
adding new events made from the byte sequence of a new 
document when it fails to match the incoming byte sequence 
with the DFA. Let us consider this situation by continuing 
with the example. 

On continuing, the possible next events are still the same, 
text, <x a="ccc" p:b="ddd"/>, or </p:e>, neither of which 
matches with the incoming documents byte-sequence, <y/>. 
Therefore we cannot move to any next state in the state ma- 
chine. At this point, we must partially parse the document 
from the last matched position. In this example, we parse 
from the position immediately after text. Hence, we parse 
the document from <y/></p:e> with only one event by using 
the information about the last matched state. Notice that 
we also have to check that the parsed result is well-formed. 
In this case, we obtain a new event: 


[EmptyElementTag: name="y" uri="" {Attributes: }] 


After parsing it, we are able to incrementally add a new 
edge. In this example, the context is still the same even after 
the EmptyElementTag event. Thus, we only have to add a 
connection to the same state, as shown in Figure 6. 

This process eventually causes a new bifurcation in the 
state machine, but it should be stressed that this bifurcation 
leaves the automaton deterministic, because if these byte 
sequences exactly match with each other, these events must 


text 
<p:e xmIns:p="urn1"> <— </p:e> 
oo urn 1#e}, N={(prefix="p", uri="urn1")}, D={}, 3} —_3@ 


<x a="ccc" p:b="ddd"/> 


<y/> 


Figure 6: Adding a new edge as a parse event. 


be exactly matching under the same context so that we do 
not have to add a new edge. 

In practice, since adding edges inevitably consumes some 
memory, we should limit the number of edges for one state 
(i.e. a context). In particular, text events, processor instruc- 
tion events, or comment events all lack “patterns,” so that 
many of those are not worth remembering. Our Deltarser 
can limit the number of text events for each context to a 
certain fixed number, denoted by Mtext. 

Next, we can continue to process the rest of the document 
by byte matching. The next byte-sequence is </p:e> and it 
exactly matches an event in the state machine. Thus, we 
can move to the final state. Since the document ends here, 
we have finished the processing. 

It is clear that we can efficiently process the document 
with byte matching and incrementally process the document 
by following the above procedure. However, is this proce- 
dure safe? In other word, does it really ensure accepting 
only the well-formed documents? We discuss this question 
in the next subsection. 


3.5 Well-formed Automata 


Let us first state the conclusion. The automata con- 
structed by the above procedure accepts a document only 
if it is well-formed. We call such automata well-formed au- 
tomata. This characteristic is essential for the safe process- 
ing. In addition, suppose that the partial parser accepts all 
well-formed documents, and then the automata is extended 
to accept these documents. 

A formal definition of a well-formed automaton is that 
all of the paths in the automaton from the initial state to 
the final state must correspond to a well-formed document. 
Whenever the automaton remains well-formed even after it 
is extended due to incremental processing, then the pro- 
cessed document is certain to be well-formed by definition, 
because it corresponds to a valid path from the resulting 
well-formed automaton. 

We do not give rigorous proof here, but sketch why the 
above procedure assures the automaton remains well-formed. 
Context is a key concept in the proof. Let us suppose we 
have a well-formed automaton beforehand, and the partial 
parser adds some events as edges in the automaton, and 
then we obtain a new distinct path, which consists of the 
sequence of contexts, C1, ..., Cn, and C; and Cn must exist 
in the previous automaton. This condition is guaranteed by 
the fact that there is only one initial and one final state in a 
well-formed automaton, so we will share at least the initial 
and the final state for every well-formed document. 

By Proposition 1, the interpretation of the incoming byte 
sequence as XML at Cı is always the same. Therefore the 
partial parser can check whether or not the incoming byte 
sequence at the point of Cı is well-formed, because all of 
the required information to check it (other than the incom- 
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ing byte sequence itself) is completely defined within the 
context. Strictly speaking, to prove this statement, we have 
to check all of the well-formedness rules in the specification 
of XML [5, 4]. However, since we do not need any informa- 
tion from farther ahead in the document during parsing to 
check the well-formedness (according to the design of XML), 
we can give the well-defined context anyway. Since, accord- 
ing to our definition, the context has all of the information 
that affects the interpretations of incoming byte-sequences, 
it can always be mapped to (some other) well-defined con- 
text. Therefore, we can also regard our definition as a well- 
defined one. 

Suppose the automaton is still well-formed after adding 
Cı, then we create a new context, C2, and proceed to parse 
the next incoming byte sequence, and then check if it is well- 
formed in C2. Inductively, we can reach Cn that comes from 
the previous well-formed automaton, and after Cn, since the 
interpretation of the byte sequence is exactly the same, all of 
the paths after it must correspond to a part of a well-formed 
automaton. Therefore all of the new paths in the extended 
automaton correspond to well-formed documents. 

As for the first well-formed automaton, we can construct it 
by simply connecting from the initial state to the final state 
with the empty element tag event of the document element. 
Therefore, the above procedure insures the automaton is 
well-formed. 


3.6 Partial Parsing 


When the incoming byte-sequence does not match with 
any events in the state, we have to partially parse the docu- 
ment. We require a context in order to parse the document 
from an intermediate point, which is available in each state 
of the state machine. The partial parser parses the byte- 
sequence for one event and updates the context if required. 

The partial parser has few disadvantages compared to the 
typical non-partial parsers. We can instantly continue to 
parse a document since the context has all of the essential 
information. 

Note that we have to provide a special treatment for gen- 
eral entity references. According to the XML specifica- 
tion Section 4.3.2 [5], if all of the entities are well-formed, 
each logical and physical structure must be properly nested. 
Therefore we can regard the replacement text of such an 
entity reference as one event. That is, it does not alter the 
context after the entity reference if it is well-formed. Even- 
tually, the partial parser only has to treat an entity reference 
as a distinct event, and check that its replacement text fol- 
lows the constraints of well-formedness. 


3.7 Notifying SAX Events 


While processing an XML document, Deltarser sends SAX 
events to an application as its output. To minimize the 
computation cost for reuse by avoiding object construction 
as much as possible, it prepares SAX event objects in their 
final form when remembering the processing results to avoid 
object construction. 

For example, a TagWithAttributesEvent instance has an 
object that implements the Attribute interface required for 
SAX events. Since this object is immutable, we can safely 
reuse it without duplicating it. StartTagEvent and End- 
TagEvent have arrays for a startPrefixMapping SAX event 
and a endPrefixMapping SAX event, respectively, both of 
which are reusable. In contrast, we cannot reuse character 


arrays in TextEvent for characters SAX events because ap- 
plications may alter them. Thus, Deltarser duplicates them 
to send to the applications. 


4. PERFORMANCE EVALUATION 


This section describes a performance evaluation of our 
prototype implementation. First, we conducted some fun- 
damental performance analysis experiments. Then we ap- 
plied it to real-world application scenarios of XML parsing 
for Web services middleware. 

We did performance comparisons between our parser and 
two other XML parsers: Apache Xerces [27] and Piccolo [30]. 
Xerces is one of the most popular parsers and Piccolo is re- 
garded as one of the fastest SAX parsers. The experimental 
environment was using Sun Hotspot Server VM 1.4.2 on 
Linux Kernel 2.4.18-14, running on an IBM IntelliStation M 
Pro 6850-60J (Intel Xeon 2.4GHz CPU with 512KB cache) 
with 2GB of memory. 

In this experiment, we set the number of remembered text 
events in each context (defined as Mrest in Section 3.4) to 
1. This value is determined by our observation that most 
texts fall into two classes, those that never change (so one 
string is enough) and those that change almost every time 
they appear (so there is no benefit from saving them). 


4.1 Fundamental Performance Analysis 


First, a fundamental performance analysis was conducted 
by comparing the total elapsed time with the other XML 
parsers in two cases. In the first case, the incoming docu- 
ment is highly similar to the previously parsed document, 
so the parser can skip the parsing and generation of SAX 
events for a large portion of the XML message. In con- 
trast, the second case is when the incoming document is 
completely different from the previously parsed documents 
so that the performance of our parser includes the total of 
the time for parsing, for creating new automata states, and 
for generating the SAX events. 

We excluded the compilation time of the JIT compiler 
from the measurements by running the benchmarking pro- 
gram for 10,000 iterations before the measurements. In Web 
Service middleware, it is expected that the stable running 
state continues unchanged for a long time. Therefore, it is 
more appropriate to measure the execution time after the 
JIT compiler has completely compiled the code. Then we 
measure the total time to perform 10,000 iterations of pars- 
ing XML documents to estimate the average time for parsing 
one document. 


The Cost of Matching 


In this way we can measure the performance when a large 
number of XML documents with similar structures are ar- 
riving. In other words, all of the start tags and end tags 
are matched, but the text content may vary. We performed 
the experiment in two cases: the case where all of the text 
content is the same, and the case where the text content is 
always different. 

For Web Services, the structure of the request XML doc- 
uments is defined in WSDL. Therefore, request documents 
for the same operation are very similar if the documents are 
generated by the same implementation. For example, re- 
quests for a certain search operation are likely to differ only 
in the text content. Hence, in this experiment, we changed 
just the text content for the incoming documents. 
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Figure 7: Average elapsed times for parsing an XML 
document. 


Figure 7 shows the results of our experiment. In this 
experiment, some XML documents which include Google 
search responses are used. We generated some XML doc- 
uments of various sizes, from 1 KB to 64 KB, by changing 
the number of search results. Then we measured the elapsed 
time for 10,000 iterations, and the average time for one pars- 
ing is shown in Figure 7. 

“Deltarser (complete match)” and “Deltarser (element only 
match)” show the complete match and the case when all of 
text content has changed, respectively. When our parser re- 
ceives a similar document, our parser is approximately 90% 
faster than Xerces and 70% faster than Piccolo. 


The Cost of Creating New Automaton Paths 


Comparing to partially parsing text content, parsing ele- 
ments is costly in Deltarser, since this creates new states in 
the DFA. These costs are in proportion to the number of 
elements that must be partially parsed. 

In order to estimate the cost of partial parsing involving 
state creation in a DFA, we measured the elapsed time for 
applying partial parsing to a full XML document. We used 
1 KB Google search response messages, which is the same 
message used in the previous experiment. 

We also wanted to estimate in how many times Deltarser 
can compensate for the penalty of partial parsing by pro- 
cessing similar documents. Therefore we parsed many struc- 
turally similar documents with different text contents. This 
parsing process is identical to “Deltarser (element only match)” 
in the previous experiments. 

Figure 8 shows the cumulative parsing times for a group 
of similar documents. The x-axis is the number of similar 
XML documents that were parsed. 

For partially parsing and creating full states for an XML 
document, it takes 2.32 ms, whereas Xerces parsing takes 
0.14 ms and Piccolo parsing requires 0.08 ms. We can see 
that Deltarser overtakes Piccolo with regard to the total 
parsing time before 50 similar documents are processed. 


Memory Usage 


We also evaluated memory usage for the number of memo- 
rized XML documents. For this experiment, we temporarily 
modified our parser to force the automaton to store even the 
same XML document as a new path. Then we measured the 
memory usage when it stored many XML documents. Fig- 
ure 9 shows the memory usage when many 1 KB and 5 KB 
XML documents are stored. The result suggests that the 
memory consumption is acceptable in such an environment. 
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Figure 8: Cumulative parsing costs for a group of 
similar documents. 
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Figure 9: Memory consumption for storing docu- 
ments. 


A key assumption is that our parser is intended for server- 
side machines with relatively large memories. We believe 
that the result in Figure 9 shows that the memory consump- 
tion is acceptable in such an environment. Even though 64 
MB of heap memory are used for 800 documents, it should 
be rare that one parser needs to handle hundreds of different 
kinds of XML documents without any similarities. However 
it would be good to have a function for discarding relatively 
ineffective automaton paths. We could employ some well- 
known cache replacement algorithm such as the LRU (Least 
Recently Used) algorithm to do so. 


4.2 An Application Benchmark 


We evaluated Deltarser through scenarios based on a real- 
world application, the Google Web APIs [14]. This is a 
Web service that allows software developers to access a set 
of services provided by Google via the SOAP and WSDL 
standards. We provided scenarios for server-side and client- 
side services. 


The Google Web Services 


Google Web Services has three services: searching, returning 
a cached page, and spelling suggestions. The corresponding 
request operations received by the servers are doGoogle- 
Search, doGetCachedPage, and doSpellingSuggestion, and 
the corresponding response operations received by the clients 
are doGoogleSearchResponse, doGetCachedPageResponse, 
and doSpellingSuggestionResponse, respectively. 

Note that the properties of the incoming SOAP messages 
are different between the server side and the client side as 
regards Deltarser performance. In this scenario, a server 
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tends to receive different formats of incoming SOAP mes- 
sages from various clients that serialize the messages with a 
variety of SOAP implementations such as .NET, Axis, and 
so forth. Meanwhile, the client tends to send its request to 
only one SOAP container and then receives XML messages 
in a specific format serialized by a single serializer imple- 
mentation. 

For testing the server-side scenario, mixed messages of the 
three request operations were used. We assumed that the 
searching service is used more frequently (60%) than the 
other services (20% for each). For the client-side scenario, 
our experiment used only response messages for the search- 
ing service. This client scenario is reasonable since a client 
does not need to use all of the services. 


Message Serializers 


We used two serializer implementations to generate SOAP 
messages: Apache Axis and Microsoft .NET Framework. 
These two serializers generate SOAP messages with different 
namespace prefixes and different white spaces for identical 
operations. 

We provided two sets of messages to parse. The first set 
includes only the Axis version of the request messages. It re- 
sults in three “different” message groups, which would make 
Deltarser to generate three distinct paths of state transitions 
in its DFA. The second set includes both Axis and .Net ver- 
sions, resulting in six different message groups. 

We experimented with two sets of messages on the server- 
side scenario to see the impact of the numbers of serial- 
izer variations. For the client-side scenario, we only experi- 
mented with the first, Axis-only set of messages. 


Parameter Values 


The parameter values of the three operations in the request 
and response messages may vary for each message. For ex- 
ample, the query keyword parameter value varies frequently 
while the encoding parameter value remains in UTF-8. We 
assumed that 4/10 of the parameter values in the search- 
ing operation request vary frequently and that all of the 
parameter values in the other two operations requests vary 
frequently. (A total of 57% of the parameter values vary in 
the server-side scenario.) Similarly, we assumed that 11/18 
of the parameter values in the search responses varied fre- 
quently. (A total of 61% of the parameter values varied in 
the client-side scenario.) 


Experimental Results 


Figure 10 shows the throughputs of the parsers in the steady 
state for each scenario. For reference and to see the best 
and worst cases of parameter value variability, we measured 
unrealistic message scenarios for “Deltarser (element only 
match)” and “Deltarser (complete match)”, using the same 
idea as in Section 4.1. Also for reference, we measured 
two unrealistic server-side scenarios of “Server-side (.Net)” 
and “Server-side (Axis)”. Each of these scenarios uses only 
single-vender messages. 

In the server-side scenarios of “Server-side (.NET and 
Axis)”, Deltarser parses 106% faster than Xerces and 35% 
faster than Piccolo. In the client-side scenario, Deltarser 
parses 126% faster than Xerces and 73% faster than Pic- 
colo. In addition, we do not see any significant negative 
effects from the increased number of serializer implementa- 
tion variations. 
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Figure 10: Average throughputs for 100,000 docu- 
ments in the server/client-side scenarios. 
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Figure 11: Elapsed times for parsing XML messages 
in the server-side scenario. 


Figure 11 shows the elapsed times for parsing the XML 
messages before the parsers get to their steady state in 
the server-side scenarios. Since the client-side scenario also 
shows a similar pattern, we did not present those results. 

Figure 12 shows the consumed heap memory size in the 
server-side scenarios. We observed an approximately 50 KB 
increase for additional serializer implementation variations 
by comparing “Deltarser (.NET and Axis)” to “Deltarser 
(.NET)” or “Deltarser (Axis)”. 

In our experiments, the creation time for new automaton 
paths is included in the runtime execution time. However 
the cost of creation is relatively large. It is possible to ex- 
clude this overhead from the runtime execution time. For 
example, we can update the automaton in a separate thread 
process, or in a batched process at a time of low system load. 
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5. RELATED WORK 


5.1 Difference Extraction 


The problem of computing similarities between two files 
has been studied extensively. Efficient difference extraction 
implementation is available in many programs like UNIX’s 
diff command. However, it is not straight-forward to apply 
it for computing the similarity of a message against multiple 
messages. Moreover, we need to find similar parts extracted 
from multiple messages, rather than finding a single, most 
similar file. 

Manber studied how to efficiently find similar files in a 
large file system [21] and presented a tool called sif. While 
our parser directly remembers the previously-processed doc- 
uments in a DFA, sif partitions files into small parts and 
then remembers their hash values computed based on their 
textual representation. These hash values are used to look 
up small parts of files in the file system which is identical 
to a small part of a newly-given file. The purpose of these 
hash values is to estimate the overall similarity of each file 
in the file system to the new file. 

Deltarser, however, must take account of processing con- 
text for the equality of two document fragments in addition 
to textual representation. Though a similar hashing tech- 
nique is used for looking up identical contexts in Deltarser, 
it just compares two byte sequences for looking up identical 
document fragments. 


5.2 Protocol Filters 


Efficient network packet filters/classifiers [29, 2] often use 
pattern matching in order to determine appropriate filters 
applied to incoming packets. Since packet formats are not 
verbose, they do not need to optimize for sender-specific 
packet formats. 

XML schema-based optimization techniques [20, 8, 9] cor- 
respond to the packet filtering techniques in terms of XML 
processing because packet formats can be regarded as sche- 
mas for packets. Both do not optimize for variable wire 
representation of XML messages or network packets. 


5.3 Delta-encoding 


Techniques for compressing a data object like a file relative 
to another object is called delta-encoding [1, 19], and have 
been applied to network data transmission [23, 7, 26, 12], 
just to mention a few. Since delta-encoded XML documents 
have explicit differential information for previously-received 
documents, it is more straight-forward to reuse the results 
of previous processing. 

However, in order to use delta-encoding, document pro- 
ducers and consumers must make some agreements with 
each other for encoding methods, which are application spe- 
cific. Without standard specifications and recommendations 
for the encoding format, the original content of XML doc- 
uments cannot extract from the encoded documents. Such 
XML documents are no more interoperable in terms of the 
standard specifications. 


6. CONCLUDING REMARKS 


This paper proposed an XML parser named Deltarser, 
which has a novel mechanism for efficiently processing XML 
documents in most cases of XML-based infrastructural mid- 
dleware like Web Services application servers. Given a new 


XML document in a byte sequence, the XML parser pro- 
posed in this paper usually does not analyze the XML syn- 
tax but just compares the byte sequences with those which 
have already been processed. It reuses the processed results 
stored before. 

The SAX implementation of our parser complies with the 
JSR 63 standard, and the Java API for XML Processing 
(JAXP) 1.1 specification. In an experiment with a message 
for Google Web Services, it parses 126% faster than Apache 
Xerces and 73% faster than Piccolo at its best. 
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